In the next few subsections, we dive into the details of how we drew these conclusions.
3.2 Data pre-processing
To conduct deeper analysis on our data, we added the following columns to our dataset:
BMI: the Body Mass Index of each person, a continuous variable calculated using the World Health Organization guideline: \(BMI = \frac{Weight}{Height^2}\).
IsOverweight: a binary categorical value whose value is 1 if the BMI of the person is more than 25, 0 otherwise.
These columns serve as alternate metrics of the obesity of an individual, a continuous variable and a binary categorical variable, to complement NObeyesdad, the obesity category level which is a ordinal categorical variable.
These new columns can be found and used directly in the dataset scripts/xinyi-zhao_files/CleanObesityDataSet.csv.
In order to improve the visualization of certain plots, we also made local modifications to the data types of some columns for only that given plot. For instance, to plot the alluvial diagrams (figure 10), we rounded the values of the following: Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE) (some of the values of these variables were floats, presumably because the researchers took an average of multiple time periods). This made sure that the variables are discrete (whole numbers).
3.3 Analysis of Height, Weight and BMI variables
Since the dataset contains artificially generated data, we first needed to verify that the dataset matches our expectations, notably by visualizing the distribution of these three continuous variables.
Code
# Importing the necessary libraries and datasetlibrary(readr)library(dplyr)library(tidyverse)library(stats)library(vcd)library(ggplot2)library(gridExtra)library(grid)library(GGally)library(reshape2)library(colorspace)library(plyr)library(ggalluvial) library(ggridges)data <-read_csv('scripts/xinyi-zhao_files/CleanObesityDataSet.csv',show_col_types =FALSE)
3.3.1 Verification of number of rows, columns, missing values
A preliminary verifications of the number of rows, columns and missing values of our dataset show that there does not seem to be any problems with the structure of our dataset.
Code
cat("Number of rows in dataset: ", nrow(data), "\nNumber of columns in dataset: ", ncol(data),"\nNumber of missing values in dataset: ", sum(is.na(data)))
Number of rows in dataset: 2111
Number of columns in dataset: 20
Number of missing values in dataset: 0
3.3.2 Histogram
Additionally, the data seems to align with our expectations, visualizing the Height, Weight and BMI variables with a histogram (Fig. 1) shows that at first glance there is no anomalous data that stands out with respect to the people surveyed.
Code
# Histograms of Height, Weight and BMIplot1 <-ggplot(data = data, aes(x = Height)) +geom_histogram(binwidth =0.02)plot2 <-ggplot(data = data, aes(x = Weight)) +geom_histogram(binwidth =2)plot3 <-ggplot(data = data, aes(x = BMI)) +geom_histogram(binwidth =1)grid.arrange(plot1, plot2, plot3, ncol =3,top =textGrob("Histograms of the continuous variables Height, Weight and BMI",gp=gpar(fontsize=16)))
Fig. 1: Histograms of Height, Weight and BMI
3.3.3 Normality tests (Shapiro-Wilk test and QQ-plots)
We investigated whether three variables plotted in Fig. 1 follow a normal distribution by running the Shapiro–Wilk test (below) and plotting the QQ-plots of all three variables (Fig. 2).
Code
shapiro.test(data$Height)
Shapiro-Wilk normality test
data: data$Height
W = 0.99323, p-value = 2.772e-08
Code
shapiro.test(data$Weight)
Shapiro-Wilk normality test
data: data$Weight
W = 0.9765, p-value < 2.2e-16
Code
shapiro.test(data$BMI)
Shapiro-Wilk normality test
data: data$BMI
W = 0.97475, p-value < 2.2e-16
On the histogram, it seems like Weight does not follow a normal distribution as there are multiple local peaks at around 75kg, 90kg and 100kg), and neither does BMI since there is nobody whose BMI is less than 18 or higher than 50, hence the BMI distribution does not have a tail.
The height of the individuals does not appear to follow a normal distribution, as it appears to be multimodal (1.75m and 1.8m) and the Shapiro-Wilk normality test gives a p-value of 2e-8 << 0.05. This is not quite aligned with our expectations as height has been found by researchers to in general follow a normal distribution.
The weight of the individuals appears to have a slight right skew according to the QQ Plot (the plot “climbs slowly” for smaller values of the x axis). This is aligned with our expectations as body weight is found to have a right skewaccording to research.
The BMI of individuals appears to be in between a uniform distribution and a normal distribution from the QQ Plot (flat for small and big values on the theoretical normal quantiles, but not flat enough to be a true uniform distribution). This is not collaborated by other research that has found that the distribution of BMI is right-skewed, for instance by Briese et. al (2011). Yet, this is to be expected since the researchers artificially generated more high BMI data points to balance out the class imbalance.
3.3.4 Boxplot of BMI
Afterwards, we plotted a boxplot (Fig. 3) to better understand the other statistics (e.g. median, outliers) of the variable BMI.
Code
# Boxplot to visualize the BMI across all individuals ggplot(data, aes(x=BMI)) +geom_boxplot(fill ="lightblue", color ="black") +coord_cartesian(ylim =c(-2, 2)) +labs(title ="Boxplot of BMI across all individuals")
Fig. 3: Boxplot to visualize the BMI across all individuals
Code
fivenum(data$BMI)
[1] 12.99868 24.32580 28.71909 36.01650 50.81175
The Tukey Five-Number summary tells us the following: - Minimum BMI is 13
Lower hinge is 24
Median is 28.7
Upper hinge is 36.0
Maximum BMI is 50.8
Since the median is at 28.7 > 25, more than half of the sample corresponds to overweight individuals (generated or real people). Since the lower hinge is at 24 (normal weight), we also know that less than 25% of the individuals in the dataset are underweight, and also less than 25% of the individuals in the dataset are of normal weight.
Therefore, this dataset has much more data of overweight people than expected, which makes sense since the obesity level category divides “overweight” into five different categories (Overweight I & II, Obesity I, II, III).
3.3.5 Ridgeline plot of weight
Lastly, we made a ridgeline plot of weight of individuals across Obesity Types (Fig. 4).
Code
ggplot(data, aes(x = Weight, y = NObeyesdad, fill = NObeyesdad)) +stat_density_ridges(alpha =0.7) +scale_fill_viridis_d() +labs(title ="Ridgeline Plot of Obesity Types", x ="weight", y ="Obesity Type") +theme_ridges()
Fig. 4: Ridgeline Plot of weight of individuals across Obesity Types
3.4 Preliminary visualizations
Code
set.seed(123456)data <- data %>%mutate_if(is.numeric, scale)data_num <- data[c("Age", "Height", "Weight", "NCP", "CH2O", "FAF", "TUE","BMI")]# Convert all columns to numericdata_num <-sapply(data_num, as.numeric)# Sample 1000 points for clearer visualizationdata_sampled <-sample_n(as.data.frame(data_num), 300)ggpairs(as.data.frame(data_sampled), aes(alpha =0.4, size =0.02), upper =list(continuous =wrap("cor", size =12))) +theme(text =element_text(size =35))
Fig. 5: Scatterplot of each pair of numeric variables in the dataset
Code
categories <-c('Insufficient_Weight','Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II','Obesity_Type_III')# Create a mapping from category to integercategory_mapping <-setNames(1:length(categories), categories)data$NObesity_numeric <-as.integer(factor(data$NObeyesdad, levels = categories))data_pca <- data[c("FCVC","NCP","FAF","TUE","NObesity_numeric")]cor_matrix <-cor(data_pca)# Reshape the data for heatmapmelted_cor <-melt(cor_matrix)# Creating heatmapggplot(melted_cor, aes(Var1, Var2, fill = value)) +geom_tile() +geom_text(aes(label =ifelse(value >0.1&value<1, round(value, 2), '')), color='darkblue',vjust =1) +scale_fill_gradient2(midpoint =0) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ggtitle("Heatmap of Correlations")
Fig. 6: Heatmap of correlations between numerical variables
Code
# Creating individual density plots with smooth linesplot_fcvc <-ggplot(data, aes(x = FCVC)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("FCVC Density Plot") +xlab("FCVC") +ylab("Density")plot_ncp <-ggplot(data, aes(x = NCP)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("NCP Density Plot") +xlab("NCP") +ylab("Density")plot_faf <-ggplot(data, aes(x = FAF)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("FAF Density Plot") +xlab("FAF") +ylab("Density")plot_tue <-ggplot(data, aes(x = TUE)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("TUE Density Plot") +xlab("TUE") +ylab("Density")# Arrange the plots in a 2x2 gridgrid.arrange(plot_fcvc, plot_ncp, plot_faf, plot_tue, nrow =2)
Fig. 7: Density plots for Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE)
3.5 Understanding the correlation between biological factors and obesity
3.6 Correlation between lifestyle habits and obesity
Code
# Create a mosaic plot for each variable in the table plotted against 'IsOverweight'data_mosaic <- data[c("Gender", "family_history_with_overweight","FAVC","SMOKE","SCC","IsOverweight")]data_mosaic$family_history_with_overweight <-fct_relevel(data_mosaic$family_history_with_overweight, "yes", "no")data_mosaic$SMOKE <-fct_relevel(data_mosaic$SMOKE, "yes", "no")data_mosaic$FAVC <-fct_relevel(data_mosaic$FAVC, "yes", "no")data_mosaic$SCC <-fct_relevel(data_mosaic$SCC, "yes", "no")data_mosaic$IsOverweight <-as.character(data_mosaic$IsOverweight)for(var innames(data_mosaic)[names(data_mosaic) !="IsOverweight"]) {# png(paste0("mosaic_", var, "_IsOverweight.png")) contingency_table <-table(data_mosaic[[var]],data_mosaic$IsOverweight)mosaicplot(contingency_table, main =paste("Mosaic Plot for", var), ylab ="Is Overweight", xlab = var, col =c("pink", "brown"))# dev.off()}
Fig. 8a: Mosaic Plot of Gender against IsOverweight
Fig. 8b: Mosaic Plot of Family History against IsOverweight
Fig. 8c: Mosaic Plot of Frequent consumption of high caloric food against IsOverweight
Fig. 8d: Mosaic Plot of weight of whether the person smokes against IsOverweight
Fig. 8e: Mosaic Plot of calories consumption monitoring against IsOverweight